Overview

The purpose of this document is to provide an overview of how a Computerized Adaptive Test works, and simulate a simple CAT.

Prepare workspace

Load packages and create function to return item coefficients.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%()   masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

IRT-Based CAT

Steps in implementing a CAT (adapted from Magis, Yan, von Davier, 2017)

1. Item Bank.

Before we can run a computerized adaptive test, we require an item bank, typically an IRT-calibrated bank. For this demonstration, we will simulate an IRT-calibrated item bank for dichotomously scored items (responses are scored as 0 or 1).

2. Initial Step.

The first step is to determine which item(s) to select first and deliver it. Here are a few options:

  1. Item with greatest information informative around the population mean.
  2. Item with difficulty closest to an initial ability level (Urry’s rule).
  3. Using prior information (if available), and selecting item with most info close to the expected ability.
  4. A group of 2-3 items.

However, for test security purposes we should not start with the same item(s) for everyone. If we don’t use prior information and everyone starts at the same place where we assume initial ability is 0 (\({\theta}_{0} = 0\)), then it’s likely for an IRT-calibrated item bank that at that ability level, only one item would have the greatest information (option a) or have the closest difficulty (option b), and that same item would then be used to start every CAT. This is referred to as initial item selection bias or a cold start problem, when each CAT session starts with the same item.

If we don’t have any prior information (which we’ll assume for the purpose of this project), you’ll want to introduce some variability into your initial item selection mechanism.

For purposes of this simulation, we’ll use option b with some variability baked-in to the initial item selection.

3. Test Step.

After the initial item has been selected, a CAT follows an iterative process to completion.

  1. Record the item response.
  2. Score the response.
  3. Update response pattern.
  4. Estimate ability.
  5. If stopping rule is not satisfied:
    1. Identify set of eligible items.
    2. Select the next item from eligible items.
    3. Circle back to Test Step ‘1’ and repeat the process.
  6. If the stopping rule is satisfied:
    1. The test stops.
    2. The final ability estimate is delivered.

But first, a few notes on some of the steps.

3.2. Score the response

In this CAT demonstration we are using an IRT-calibrated item bank for dichotomously scored items, though other item types could certainly be used, such as ordered polytomous items (items with multiple, ordered scoring levels). Those other models would need to be incorporated into the CAT, and the item-type specific parameters (e.g., ‘step’ or ‘threshold’ parameters for the Partial Credit model) would need to be captured in the item bank, with a way to distinguish, say, the Ordered Polytomous items from Dichotomous items for selection and scoring purposes.

3.4. Estimate Ability

The proficiency estimator can impact the precision and distribution of scores. Common methods are:

  1. Maximum Likelihood Estimation (MLE) - Selects the value of \({\theta}\) that maximizes the likelihood of the response string given the item parameters.
  2. Expected a posteriori (EAP) and Maximum a posteriori (MAP) - The value of \({\theta}\) is obtained using Bayesian approaches which sets a posterior distribution.
    1. EAP is the mean of the posterior proficiency distribution for a given individual
    2. MAP is the mode of the posterior proficiency distribution for a given individual

For this document, we’ll use MLE to estimate ability after each item.

MLE

One drawback of MLE is that it is undefined for response patterns with zero variance (all 0’s or all 1’s), and this can become problematic at early stages in the CAT. For instance, if someone gets the first item wrong, all we know is that their ability level is probably lower than the difficulty parameter of the first items (or thereabouts). The MLE estimate will be an extremely low value, and if we’re basing item selection on the proximity of item difficulty to current ability, then likely the second item on the test will be the easiest item in the bank.

To demonstrate this, I’ve created a MLE estimation function est_ability_mle that will estimate ability based on the inputs:

  1. responses : a vector of item responses
  2. as : a vector of IRT discrimination a parameters
  3. bs : a vector of IRT discrimination b parameters
  4. cs : a vector of IRT discrimination c parameters

Let’s imagine we answered the first item incorrectly (response = 0). Item parameters are (1, .5, .1) for (a, b, c).

est_ability_mle(0, 1, .5, .1, kludge = FALSE)$ability_est
## [1] -4.09

An very low ability estimate (-4.09). If we base item selection on this estimate, the next item selected would b the easiest in the bank, perhaps in the neighborhood of -3 logits.

Let’s say we somehow miss the second question too:

est_ability_mle(
  c(0, 0),   # Response are 0 and 0
  c(1, 1),   # Both 'a' params are 1
  c(.5, -3), # Item 1 difficulty = 0.5, Item 2 difficulty = -3
  c(.1, .1),  # Both 'c' params are 0.1
  kludge = FALSE
)$ability_est
## [1] -7.58

Again, the ability estimate is very low, and we’d witness similar extreme values if we got both items correct. Therefore, we need some way to adjust the MLE function so item selection early in the CAT isn’t subject to wild swings in ability estimate.

MLE Adjustment Factor

One way we can do this is by using an adjustment factor, which adds (or subtracts) a value to all values in the response string when putting them into the likelihood function. Although our IRT model presumes response values of either 0 or 1, the likelihood function will take any numerical value for responses (whether or not they make sense).

Therefore, by adjusting the response strings with no variance by a little bit, we can restrict our MLE ability estimates until we get some variability in responses.

To do this, I created an MLE function (est_ability_mle_kludge) that employs an adjustment factor (kludge) of \(\frac{1}{2\sqrt{n}}\) to each item, where \(n\) is the number of items in the response vector. For instance, the response vector c(0,0,0,0) would have an adjustment factor of \(\frac{1}{2\sqrt{4}} = \frac{1}{4}\) be converted to c(0.25, 0.25, 0.25, 0.25).

By using this method, we can estimate ability after each of the first 4 items of a CAT. For simplicity, we’ll fix the a and c parameters to be equal for each run, and set the ‘b’ parameter for each item to be near the previous ability estimate.

# Note that default for est_ability_mle is 'kludge = TRUE'

est_ability_mle(0, 1, 0.5, .1)$ability_est
## [1] -0.55
# [1] 0.5498125; let's set 'b' for Item 2 to .55

est_ability_mle(rep(0,2), rep(1,2), c(0.5, .55), rep(.1, 2))$ability_est
## [1] -1.2
# [1] -1.20393; let's set 'b' for Item 3 to -1.20

est_ability_mle(rep(0,3), rep(1,3), c(0.5, .55, -1.20), rep(.1, 3))$ability_est
## [1] -2.88
# [1] -2.882119; let's set 'b' for Item 4 to -2.88

est_ability_mle(rep(0,4), rep(1,4), c(0.5, .55, -1.20, -2.88), rep(.1, 4))$ability_est
## [1] -5.05
# [1] -5.053933

As we can see, this approach really restricts the ability estimates when we have no variability in response patterns. Once the test-taker provides a response that introduces variability (e.g., they answer the 5th item correct), then this adjustment factor is ignored and the MLE estimate is based solely on the actual response patterns. Let’s see what happens if they get the next 3 items correct.

est_ability_mle(c(0,0,0,0,1), rep(1,5), c(0.5, .55, -1.20, -2.88, -5.05), rep(.1, 5))$ability_est
## [1] -4.24
est_ability_mle(c(0,0,0,0,1,1), rep(1,6), c(0.5, .55, -1.20, -2.88, -5.05, -4.24), rep(.1, 6))$ability_est
## [1] -3.57
est_ability_mle(c(0,0,0,0,1,1,1), rep(1,7), c(0.5, .55, -1.20, -2.88, -5.05, -4.24, -3.57), rep(.1, 7))$ability_est
## [1] -3.08

There we go, it’s coming back down to earth.

3.5.b. Item Selection

We have a few options for selecting the next items in a CAT:

  1. Maximum Fisher Information (MFI): Item with most information at the current ability estimate. \[ j_t^* = \arg \max_{j \in S_t} I_j(\hat{\theta}_{t-1}(X_{t-1})) \]

  2. bOpt Criterion, or Urry’s Rule: Item with the difficulty nearest the current ability estimate. \[ j_t^* = \arg \min_{j \in S_t} \left| \hat{\theta}_{t-1}(X_{t-1}) - b_j \right| \] - Note this will be the same as MFI for Rasch and 2PL models

  3. Maximum Likelihood Weighted Information (MLWI): Weights the information by the likelihood function of the currently administered response pattern. Addresses the issue of MFI being severely biased in early stages of the CAT. \[ j_t^* = \arg \max_{j \in S_t} \int_{-\infty}^{+\infty} L(\theta | X_{t-1}) I_j(\theta) \, d\theta \]

  4. Maximum Posterior Weighted Information (MPWI) \[ j_t^* = \arg \max_{j \in S_t} \int_{-\infty}^{+\infty} f(\theta) L(\theta | X_{t-1}) I_j(\theta) \, d\theta \]

Legend of Terms:

  • \(j_t^*\): Selected item at step \(t\)

  • \(S_t\): Set of eligible items at step \(t\)

  • \(I_j(\theta)\): Item information function for item \(j\)

  • \(\hat{\theta}_{t-1}(X_{t-1})\): Current provisional ability estimate based on the current response pattern \(X_{t-1}\)

  • \(b_j\): Difficulty level of item \(j\)

  • \(L(\theta | X_{t-1})\): Likelihood function given response pattern \(X_{t-1}\)

  • \(f(\theta)\): Prior distribution of ability (e.g., standard normal distribution)

There are several other selection methods we could use as well. For our purpose, let’s just select the easiest one to implement now: bOpt Criterion, since the only values needed are the current ability estimate (\(\hat{\theta}_{t-1}(X_{t-1})\)) and eligible item locations (\(b_j\)).

4. Stopping Step

This step sets the parameters for terminating a CAT. There are four main stopping rules that are commonly considered:

  1. Length
    • This sets the total number of items to be administered. Once this number is reached, the CAT terminates.
    • This can ensure everyone sees the same # of items (but at the cost of varying degrees of accuracy)
  2. Precision
    • Stops the CAT when the ability level reaches a predefined level of precision (e.g., a provisional ability estimate has a standard error smaller than some pre-set criterion.)
    • This has the benefit of efficiency, at the cost of different lengths of assessments.
  3. Classification
    • Used for testing skill mastery.
    • The main goal is to determine if the test taker has an ability greater or less than the level of mastery.
    • In practice, this mastery level is set, and provisional confidence intervals are set around the provisional ability estimate. If the confidence level overlaps the mastery level, there’s not enough certainty around classification and the test continues. If the provisional confidence level doesn’t contain the mastery level, then a classification determination can be made confidently, and the test can terminate.
  4. Information
    • Focus is the information carried by the remaining items in the item bank.
    • The threshold is the minimum information carried by at least one of the eligible items. Condition for the CAT to continue is that remaining items have enough information to significantly increase the total information. If, at a provisional ability estimate, all eligible items have information values smaller than the threshold, the CAT stops.

For our demonstration, we’ll use both length and precision as stopping criteria; we’ll stop the test once (a) the standard error of our ability estimate falls below a pre-defined cutoff, otherwise the test will stop once it reaches a certain length (we want to limit the testing time).

Basic CAT, Step-By-Step

As I’ve given the overview of how a CAT works, we’ve noted a few decisions we will make for this current simulation Let’s summarize them up front:

Structure

  1. Item Bank.
    • Our simulated item bank will include IRT-calibrated parameters for dichotomously scored items.
  2. Initial Step.
    • Without prior information (or making assumptions there), we’ll use Urry’s Rule and start selecting items with a difficulty near a current ability estimate (which we’ll set to 0).
    • To avoid over exposure of the item with b closest to 0, we’ll add some noise to this selection.
  3. Test Step.
    • We’ll record and score response patterns as normal.
    • We’ll use MLE for estimating ability
    • For response patterns with 0 variance, we’ll adjust the patterns with a factor of \(\frac{1}{2\sqrt{n}}\) to avoid wild swings in ability estimate early in the cat.
  4. Stopping Step.
    • The test will stop once:
      1. the standard error of our ability estimate falls below a pre-defined cutoff.
      2. the test reaches a certain length, to limit total testing time.
    • Additionally, we may want the test

Now that we know how we will set up our simulation, let’s make it happen.

1. Generate Item Bank

Let’s simulate a 3pl item bank using the generate_item_bank function.

# Number of items
n_items <- 500

# Set seed (for random params)
set.seed(015)

# Rasch, 1pl, 2pl, or 3pl
item_type = "3pl"

# Generate an item bank
item_bank <- generate_item_bank(n_items, model = item_type)

1.a. Visualize Item Bank

And let’s visualize our item bank characteristics - IIFs, TIF, ICCs, and parameter distributions.

2. Initial Step

Given no prior information about the test-taker, let’s select an initial item for administration using the initial_item function, given the item_bank dataframe we created earlier. This function will create the test_event table and populate it with information from the first item selected.

set.seed(123)
(test_event <- initial_item(item_bank_df = item_bank))
##     order item_id    a      b     c response_score current_ability
## 383     1     383 1.24 0.0147 0.176             NA              NA
##     current_ability_se   item_selection_ts response response_ts
## 383                 NA 2024-07-18 20:07:20       NA        <NA>

The first item selected is item_id = 383.

3. Test Step

3.1 Score the response, estimate ability, and update test_event table

Use the score_reseponse function to score this item. Let’s assume we get the item correct.

(test_event <- score_response(
  test_event_df = test_event, 
  item_id = test_event$item_id[nrow(test_event)],  # Item ID
  response = 1                                     # 1 = Correct, 0 = Incorrect
  ))
##     order item_id    a      b     c response_score current_ability
## 383     1     383 1.24 0.0147 0.176              1           0.325
##     current_ability_se   item_selection_ts response         response_ts
## 383               1.71 2024-07-18 20:07:20        1 2024-07-18 20:07:20

Given that we got the item correct, our new ability estimate is 0.325, with a standard error of the estimate of 1.705. By default this score_response function uses the MLE kludge we mentioned earlier. If we didn’t use that kludge, here’s what the test_event table would look like:

(score_response(
  test_event_df = test_event, 
  item_id = test_event$item_id[nrow(test_event)],  # Item ID
  response = 1,                                    # 1 = Correct, 0 = Incorrect
  kludge = FALSE
  ))
##     order item_id    a      b     c response_score current_ability
## 383     1     383 1.24 0.0147 0.176              1            3.89
##     current_ability_se   item_selection_ts response         response_ts
## 383               9.93 2024-07-18 20:07:20        1 2024-07-18 20:07:20

An ability estimate of 3.89, with a SE of 9.93… Yes, let’s use that kludge moving forward. Again it’s only going to affect item selection until there is variance in the response pattern (someone with a zero score gets an item correct, or someone with a perfect score misses an item).

3.2 Check stopping criteria

Next, we’ll check to see if our stopping criteria has been met. Since we haven’t set stopping criteria, let’s do that now.

  • Cap the number of items at 20
  • Set the standard error threshold to be 0.5 (stop if the value is less than 0.5)

Based on those, we’ll use the stop_test function to evaluate our test_event table against our criteria.

  • TRUE means the criteria has been met; stop the test
  • FALSE means the criteria has not been met; continue the test
stop_max_items <- 20
stop_min_se <- 0.5

stop_test(test_event_df = test_event,
          max_items = stop_max_items,
          min_se = stop_min_se)
## [1] FALSE

Don’t stop test. Keep moving.

3.3 Update the table of eligible items

eligible_items <- update_eligible_items(eligible_items_df = item_bank, 
                                        test_event_df = test_event)

paste("Of",nrow(item_bank),"items in the bank, ",nrow(eligible_items), "are eligible for selection.")
## [1] "Of 500 items in the bank,  499 are eligible for selection."

3.4 Select next item

The “Urry’s Rule” selection criteria selects the item with the difficulty parameter closest to the test-taker’s current ability estimate.

set.seed(123)
(test_event <- next_item(eligible_items_df = eligible_items, 
                        test_event_df = test_event))
##     order item_id    a      b      c response_score current_ability
## 383     1     383 1.24 0.0147 0.1762              1           0.325
## 316     2     316 1.40 0.3279 0.0207             NA              NA
##     current_ability_se   item_selection_ts response         response_ts
## 383               1.71 2024-07-18 20:07:20        1 2024-07-18 20:07:20
## 316                 NA 2024-07-18 20:07:20       NA                <NA>

And at this point, we could just keep running this, changing our “answer” to 0 or 1, until the stopping criteria is met, and the test ends.

Continue until stopping criteria are met

# Answer to the current question
# 0 = Incorrect, 1 = Correct
answer <- 0

# Score response
test_event <- score_response(test_event_df = test_event, 
                             item_id = test_event[nrow(test_event),"item_id"], 
                             response = answer)
# Check Stopping Criteria
if(stop_test(test_event_df = test_event, 
             max_items = stop_max_items, 
             min_se = stop_min_se) == FALSE) {
  # If stopping criteria hasn't been met, update Eligible items
  eligible_items <- update_eligible_items(eligible_items_df = eligible_items, 
                                          test_event_df = test_event)
  # And select the next item.
  (test_event <- next_item(eligible_items_df = eligible_items, 
                           test_event_df = test_event))
} else {
  # If the stopping criteria has been met, end the test.
  print(test_event)
  "The test is complete!"
}
##     order item_id    a       b      c response_score current_ability
## 383     1     383 1.24  0.0147 0.1762              1           0.325
## 316     2     316 1.40  0.3279 0.0207              0          -0.168
## 80      3      80 1.66 -0.1786 0.1252             NA              NA
##     current_ability_se   item_selection_ts response         response_ts
## 383               1.71 2024-07-18 20:07:20        1 2024-07-18 20:07:20
## 316               1.10 2024-07-18 20:07:20        0 2024-07-18 20:07:20
## 80                  NA 2024-07-18 20:07:20       NA                <NA>

Basic CAT, Simulation for n_people

Now that we have our CAT working, let’s set it up and simulate for a few hundred people to check how the CAT functions.

1. Item Bank

We’ll use the same item bank as in our example: item_bank

2. Simulate CAT

Let’s simulate our CAT for 500 people, using the same stopping criteria as before.

  • Max items = 20
  • Min SE = 0.50
# Number of test takers
n_people <- 500

# Define the seed outside of the function
seed = 123

# Stopping criteria, restated
stop_max_items <- 20
stop_min_se <- 0.5

# Create a set of ability estimates
sample_abilities <- rnorm(n_people, 
                          mean = 0, 
                          sd = 1)

When running this simulation, however, the consistency of responses will impact how the CAT performs. For instance, we could simulate every respondent (‘sim’) answering exactly as expected based on their actual ability \({\theta}\) and b-parameter of item j, \(b_{j}\). So if \(b_{j} < {\theta}\), answer correct; if \({\theta} < b_{j}\), answer incorrect. However this type of highly consistent responding is not typical, and instead, responses will have some degree of inconsistency.

To accommodate this in our simulation, the function simulate_cat includes an argument to vary the response_consistency when simulating responses. The function simulates responses by selecting a response from a binomial distribution of the prob function for a given ability and an item’s a, b, and c parameters. The response_consistency value multiplies the ‘a’ parameter, making the probability density function steeper and therefore a selection from the binomial distribution will be more consistent with the sim’s ability, \({\theta}\). The default response_consistency is set to 1, which doesn’t impact the prob function, and setting this argument to values above 1 will result in more consistent responding, values between 0 and 1 will result in less consistent responding.

To demonstrate the difference in simulated response consistency, we’ll run the simulation for two different response_consistency levels: 1 and 5.

# Run the simulation with less consistent responding
test_cat_consistency1 <- simulate_cat(item_bank = item_bank, 
                           abilities = sample_abilities, 
                           seed = seed, 
                           max_items = stop_max_items,
                           min_se = stop_min_se,
                           response_consistency = 1, 
                           silent = TRUE)

# Run the simulation with very consistent responding
test_cat_consistency5 <- simulate_cat(item_bank = item_bank, 
                           abilities = sample_abilities, 
                           seed = seed, 
                           max_items = stop_max_items,
                           min_se = stop_min_se,
                           response_consistency = 5,
                           silent = TRUE)

3. Review CAT Summary Statistics

Inconsistent

##                              n    mean  median     sd    min   max trimmed
## ability                    500  0.0346  0.0207 0.9728 -2.661  3.24  0.0252
## n_items                    500 14.6940 15.0000 1.9432  9.000 20.00 14.6000
## final_ability              500  0.0739  0.1055 1.0764 -3.075  3.55  0.0737
## final_ability_se           500  0.4876  0.4889 0.0101  0.447  0.55  0.4884
## test_info_at_final_ability 500  3.0861  3.1366 0.3827  1.614  4.05  3.1104
## residual                   500 -0.0393 -0.0254 0.5611 -1.766  1.60 -0.0266
##                                mad  range     skew  kurtosis       se
## ability                    0.93579  5.902  0.08586 -0.058207 0.043504
## n_items                    1.48260 11.000  0.43368  0.182519 0.086901
## final_ability              1.03683  6.621  0.00851  0.138910 0.048136
## final_ability_se           0.00936  0.103 -0.21047  3.419723 0.000453
## test_info_at_final_ability 0.34905  2.439 -0.71383  0.815729 0.017115
## residual                   0.58545  3.366 -0.19926  0.000292 0.025093

Consistent

##                              n     mean  median      sd    min    max trimmed
## ability                    500  0.03459  0.0207 0.97277 -2.661  3.241  0.0252
## n_items                    500 14.12200 14.0000 1.53356  9.000 18.000 14.1200
## final_ability              500  0.03151 -0.0309 1.00152 -2.764  3.297  0.0149
## final_ability_se           500  0.48721  0.4890 0.00939  0.447  0.500  0.4882
## test_info_at_final_ability 500  3.09034  3.1347 0.32653  1.187  3.906  3.1169
## residual                   500  0.00308  0.0145 0.22817 -1.817  0.636  0.0104
##                                mad  range    skew kurtosis      se
## ability                    0.93579 5.9020  0.0859  -0.0582 0.04350
## n_items                    1.48260 9.0000 -0.0185  -0.0102 0.06858
## final_ability              0.96599 6.0610  0.1306  -0.0162 0.04479
## final_ability_se           0.00949 0.0534 -0.9362   0.7162 0.00042
## test_info_at_final_ability 0.26961 2.7191 -1.1699   3.1640 0.01460
## residual                   0.21059 2.4529 -1.1189   7.4580 0.01020

Although the distribution of estimates are similar, the error associated with the inconsistent group is quite a bit larger. Let’s

4. Visualizations

Ability Plots

Inconsistent

Consistent

Residual Plots

Inconsistent

Consistent

Ability Density

Inconsistent

Note: Density of actual abilities is in grey.

Consistent

Note: Density of actual abilities is in grey.

Test Length

Inconsistent

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).

Consistent

Information by Ability

Inconsistent

Consistent

CAT Response Pattern and Ability Estimation Plots

And let’s visualize the CAT response pattern and ability estimates for a few cases.

Since we used the same ability estimates in our simulations, and the only thing that changed between the two was response consistency, let’s see how those affected how the CAT operated.

Let’s pick the case with the ability closest to 0, case 7.

cat_test_plot(test_cat_consistency5, 7, "Consistent Group")

These plot show a number of things related to this CAT administration:

  • The x-axis is the order of item administration, from 1 to \(n\).
  • The y-axis represents the latent theta scale, on which b-parameters and ability estimates are located.
    • Easier items and lower abilities are lower (more negative) on the Theta scale, harder items and higher ability are higher (more positive) on the scale.
  • Items are represented by the filled circles. In this case there are 13 circles, as the CAT stopped before item 14.
    • Green fill indicates the item was answered correctly.
    • Red fill indicates the item was answered incorrectly.
  • Ability estimates are depicted as black diamonds.
    • They are offset to represent the step between item administrations in which ability is estimated. For instance, after item 1 is administered and answered correctly, the ability estimate of this sim is 0.60. Item 2 was slightly harder, which they answered incorrectly, and their ability was estimated to be -0.38. The final ability estimate here is about 0.42.
  • The light blue band about the ability estimate is the standard error of the estimate.
  • The horizontal grey line represents the sim’s actual ability. This will be more evident in the next plots.

Let’s take a look at the CAT administrations for a few cases.

CAT Response Plot: Case 1

Inconsistent

cat_test_plot(test_cat_consistency1, 1, "Inconsistent Group")

Consistent

cat_test_plot(test_cat_consistency5, 1, "Consistent Group")

CAT Response Plot: Case 2

Inconsistent

cat_test_plot(test_cat_consistency1, 2, "Inconsistent Group")

Consistent

cat_test_plot(test_cat_consistency5, 2, "Consistent Group")

CAT Response Plot: Case 3

Inconsistent

cat_test_plot(test_cat_consistency1, 3, "Inconsistent Group")

Consistent

cat_test_plot(test_cat_consistency5, 3, "Consistent Group")

Summary

This document provided an overview of how a Computerized Adaptive Test works and demonstrated a simple CAT simulation. Key components covered include:

  1. Creating an item bank with IRT-calibrated parameters
  2. Implementing initial item selection strategies
  3. Estimating ability using Maximum Likelihood Estimation (MLE)
  4. Selecting subsequent items based on current ability estimates
  5. Applying stopping criteria to end the test

The document then simulated CAT administrations for 500 test-takers under two conditions: consistent and inconsistent responding. Visualizations were provided to compare the performance of the CAT under these conditions, including ability estimation accuracy, test length, and information at ability estimates. Overall, this simulation demonstrated how CATs can efficiently estimate test-taker abilities with fewer items than fixed-form tests, and showed the impact of response consistency on CAT performance. The concepts and code provided serve as a foundation for understanding and implementing basic CAT systems.